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FIELD OF THE INVENTION 

The present invention relates to systems and methods for multimedia processing. For 
10 example, the present invention provides systems and methods for receiving spoken audio, 

converting the spoken audio to text, and transferring the text to a user. As desired, the speech or 
text can be translated into one or more different languages. Systems and methods for real-time 
conversion and transmission of speech and text are provided, including systems and methods for 
large scale processing of multimedia events. 

15 

BACKGROUND OF THE INVENTION 

The Intemet has revolutionized the way that information is dehvered and business is done. 
In June of 1999, Nielsen/NetRatings reported that there were a total of 63.4 milhon active Intemet 
users in the United States, and 105.4 milhon total Intemet users with Intemet access. The average 

20 user spent 7 hours, 38 minutes on-line that month. Furthermore, user year-to-year growth rate is 
expected be in the range of 15% to 25% percent. Worldwide, it expected that there be greater than 
250 milhon residential users, and greater than 200 milhon corporate users by the year 2005. 

In the last few years, improvements in software and hardware have allowed the Intemet to 
be used on a large scale for the transmission of audio and video. Such improvements include the 

25 availabihty of real-time streaming audio and video. Numerous media events are now "broadcast" 
live over the Intemet, allowing users to see and hear speeches, music events, and other artistic 
performances. With further increases in speed, the Intemet promises to be the primary method for 
transmitting and receiving multimedia information. Present real-time applications, however, are 
limited in their flexibihty and usefulness. For example, many real-time audio and video 

30 application do not permit users to edit or otherwise manipulate the content. The art is in need of 
new systems and methods for expanding the usefulness and flexibihty of multimedia information 
flow over electronic communication systems. 
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SUMMARY OF THE INVENTION 

The present invention relates to systems and methods for multimedia processing. For 
example, the present invention provides systems and methods for receiving spoken audio, 
converting the spoken audio to text, and transferring the text to a user. As desired, the speech or 
5 text can be translated into one or more different languages. Systems and methods for real-time 
conversion and transmission of speech and text are provided. 

For example, the present invention provides Web-enabled systems comprising audio-to- 
text captioning capabilities, audio conference bridging, text-to-speech conversion, foreign 
language translation, web media streaming, and voice-over-IP integrated with processing and 
10 software capabilities that provide streaming text and multimedia information to viewers in a 
number of formats including interactive formats. 

The present invention also provides foreign translation systems and methods that provide 
Q end-to-end audio transcription and language translation of hve events (z.e., from audio source to 

=|:i intended viewer), streamed over an electronic communication network. Such systems and 

[Si 15 methods include streaming text of the spoken word, conniplete accumulative transcript, the ability 

to convert text back into audio in any desired language, and comments/questions handling 
j; submitted by viewers of the multimedia information (e.g. , returned to each viewer in their selected 

l^. language), hi some embodiments, text streaming occurs through independent encoded media 

tt^ streaming (e.g. , separate IP ports). The information is provided in any desired format (e.g., 

20 MICROSOFT, REAL, QUICKTIME, etc.). In some embodiments, real-time translations are 
J'f provided in multiple languages simuUaneously or concurrently (e.g., each viewer selects / or 

changes their preferred language during the event). 

The present invention also provides audio to text conversion with high accuracy in short 
periods of time. For example, the present invention provides systems and methods for accurate 
25 transcription of live events to 95-98%, and accurate transcription of any event to 100% within a 
few hours of event completion. 

The systems and methods of the present invention may be apphed to interactive formats 
including talk-show formats. For example, as described in more detail below, in some 
embodiments, the systems and methods of the present invention provide an electronic re-creation 
30 of the television talk-show model over the web without requiring the participants to use or own 
any technology beyond a telephone and a web connected device (e.g., a personal computer). Talk- 
show participation by invited guests or debatees may be conducted through the web. In some 
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embodiments, the system and methods employ web-based, moderator and participant controls 
and/or web-based call-in "screened controls. In some embodiments, viewer interaction is handled 
via email, comment/question queue maintained by a database, and/or phone call-ins. In some 
preferred embodiments of the present invention, real-time language translation in multiple 
5 languages is appHed to allow participation of individuals, independent of their language usage. 
Streaming multimedia information provided in the interactive format includes, as desired, 
graphical or video slides, images, and/or video. 

The present invention further provides systems and methods for complete re-creation of the 
classroom teaching model, including live lectures (audio and video), presentation slides, slide 
10 notes, comments / questions (via email, chat, and/or Uve call-ins), streaming transcript / foreign 
translations, complete lecture transcript, streaming videos, and streaming PC screen capture demos 
with audio voice-over. 

For use in such applications, the present invention provides a system comprising a 
processor, said processor configured to receive multimedia information and encode a plurahty of 
15 information streams comprising a separately encoded first information stream and a separately 

encoded second information stream fi-om the multimedia information, said first information stream 
comprising audio information and said second information stream comprising text information 
{e.g., text transcript information generated ft"om the audio information). The present invention is 
not limited by the nature of the muhimedia information. Multimedia information includes, but is 
20 not limited to, live event audio, televised audio, speech audio, and motion picture audio. In some 
embodiments, the multimedia information comprises information fi:om a plurality of distinct 
locations (e.g., distinct geographic locations). 

In some embodiments, the system fiuther comprises a speech to text converter, wherein the 
speech to text converter is configured to produce text fi-om the multimedia information and to 
25 provide the text to the processor. The present invention is not limited by the nature of the speech 
to text converter. In some embodiments, the speech to text converter comprises a stenograph (e.g., 
operated by a stenographer). In other embodiments, the speech to text converter comprises voice 
recognition software. In preferred embodiments, the speech to text converter comprises an error 
corrector configured to confirm text accuracy prior to providing the text to the processor, 
30 In some embodiments, the processor further comprises a security protocol. In some 

preferred embodiments, the security protocol is configured to restrict participants and viewers 
from controlling the processor (e.g., a password protected processor). In other embodiments, the 
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system further comprises a resource manager {e.g., configured to monitor and maintain efficiency 
of the system). 

In some embodiments, the system further comprises a conference bridge configured to 
receive the multimedia information, wherein the conference bridge is configured to provide the 

5 multimedia information to the processor. In some embodiments, the conference bridge is 

configured to receive multimedia information from a plurality of sources (e.g., sources located in 
different geographical regions). In other embodiments, the conference bridge is further configured 
to allow the multimedia information to be viewed (e.g., is configured to allow one or more viewers 
to have access to the systems of the present invention). 

10 In some embodiments, the system further comprises a delay component configured to 

receive the multimedia information, delay at least a portion of the multimedia information, and 
send the delayed portion of the multimedia information to the processor. 

In some embodiments, the system further comprises a text to speech converter configured 
to convert at least a portion of the text information to audio. 

15 In still other embodiments, the system further comprises a language translator configured 

to receive the text information and convert the text information from a first language into one or 
more other languages. 

In some embodiments, the processor is further configured to transmit a viewer output 
signal comprising the second information stream (e.g., transmit information to one or more 

20 viewers). In some embodiments, the viewer output signal further comprises the first information 
stream. In preferred embodiments, the viewer output signal is compatible with a multimedia 
software application (e.g., a multimedia software appUcation on a computer of a viewer). 

In some embodiments, the system further comprises a software appUcation configured to 
display the first and/or the second information streams (e.g., allowing a viewer to Usten to audio, 

25 view video, and view text). In some preferred embodiments, the software application is 

configured to display the text information in a distinct viewing field. In some embodiments, the 
software appUcation comprises a text viewer. In other embodiments, the software appUcation 
comprises a multimedia player embedded into a text viewer. In some preferred embodiments, the 
software application is configured to allow the text information to be printed. 

30 The present invention further provides a system for interactive electronic communications 

comprising a processor, wherein the processor is configured to receive multimedia information, 
encode an information stream comprising text information, send the information stream to a 
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viewer, wherein the text information is synchronized with an audio or video file, and receive 
feedback information from the viewer. 

The present invention also provides methods of using any of the systems disclosed herein. 
For example, the present invention provides a method for providing streaming text information, 
the method comprising providing a processor and multimedia information comprising audio 
information; and processing the multimedia information with the processor to generate a first 
information stream and a second information stream, said first information stream comprising the 
audio information and said second information stream comprising text information, said text 
information corresponding to the audio information. 

In some embodiments, the method fiirther comprises the step of converting the text 
information into audio. In other embodiments, the method fiirther comprises the step of translating 
the text information into one or more different languages. In still other embodiments, the method 
fiirther comprises the step of transmitting the second information stream to a computer of a viewer. 
In other embodiments, the method fiarther comprises the step of receiving feedback information 
(e.g*., questions or comments) fi-om a viewer. 

The present invention fiirther provides systems and methods for providing translations for 
motion pictures, television shows, or any other serially encoded medium. For example, the present 
invention provides methods for the translation of audio dialogue into another language that will be 
represented in a form similar to subtitles. The method allows synchronization of the subtitles with 
the original audio. The method also provides a hardcopy or electronic translation of the dialogue 
in a scripted form. The systems and methods of the present invention may be used to transmit and 
receive synchronized audio, video, timecode, and text over a communication network. In some 
embodiments, the information is encrypted and decrypted to provide anti-piracy or theft of the 
material. Using the methods of the present invention, a dramatic reduction (e.g., 50% or more) in 
the time between a domestic motion picture release and foreign releases is achieved. 

In some such embodiments, the present invention provides methods for providing a motion 
picture translation comprising, providing: motion picture audio information, a translation system 
that generates a text translation of the audio; and a processor that encodes text and audio 
information; processing the motion picture audio information with the translation system to 
generate a text translation of the audio; processing the text translation with the processor to 
generate encoded text information; processing the motion picture audio information with the 
processor to generate encoded audio information; and synchronizing the encoded text information 



and the encoded audio information. Such methods find use, for example, in reducing the cost and 
process delay of motion picture translations by more than 50% (e.g., 50%, 51%, . . ., 90%, . . .). 

The present invention also provides a system comprising a processor configured to receive 
text information fi-om a speech-to-text converter, receive multimedia information from a 

5 conference bridge, encode text information into an information stream, encode multimedia 
information into an information stream, and send and receive information from a language 
translator. In some embodiments, the processor further comprises a resource manager configured 
to allow said processor to continuously process 10 or more (e.g., 11, 12, . . 100, . . 1000, . . .) 
information streams simultaneously. 

10 The present invention fiirther provides systems and methods for two-way real time 

conversational language translation. For example, the present invention provides methods 
comprising, providing: a conference bridge configured to receive a plurality of audio information 
inputs, a speech-to-text converter, a text-to-speech converter, and a language translator; inputting 
audio from a first user to said conference bridge to provide first audio information; converting the 

15 first audio information into text information using the speech-to-text converter; translating the text 
information into a different language using the language translator to generate translated text 
information; converting the translated text information into translated audio using the text-to- 
speech converter; and providing the translated audio to a second (or other) user(s). 

The present invention also provides scaled up systems and methods. For example, the 

20 present invention provides a system comprising a speech-to-text converter and a processor, said 
processor configured to receive text information from the speech to text converter and encode 
10,000 or more text information streams (e,g,, text information streams that are sent to viewers). 
In some embodiments, the system is configured (e.g., using a plurahty of processors) to 
simultaneously transmit 10,000 or more text information streams (e.g., 100,000 or more to 

25 1,000,000 or more). In some embodiments, the system ftirther comprises a caption server 

configured to receive text information from the speech-to-text converter and configured to transmit 
text information to the processor. In some embodiments, the caption server is configured to 
simultaneously receive text information from 200 or more speech-to-text converters. In some 
embodiments, the caption server comprises multiple processors, wherein an unlimited number 

30 simultaneously text information streams are received from an unlimited number of speech-to-text 
converters. In some embodiments, the speech-to-text converter comprises a computer running 
captioning software. In some preferred embodiments, the computer comprises a software 
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application that allows text information to be transmitted over an Internet without the use of a 
serial to IP device. 

The present invention further provides a system comprising: a conference bridge 
configured to receive audio information; a speech-to-text converter configured to receive audio 

5 information from the conference bridge and to convert at least a portion of the audio information 
into text information; and a processor configured to receive the text information from the speech- 
to-text converter and to encode a text information stream. In some embodiments, one or more of 
the transmissions (e.g., receipt of information by the conference bridge, transmission of 
information from the conference bridge to the speech-to-text converter, transmission of 

10 information from the speech-to-text converter to the processor, or transmission of text information 
streams from the processor) is carried out by a wireless communication system. In some 
embodiments, the processor is fiirther configured to transmit the text information stream to a 
computer system of a viewer. In some preferred embodiments, the processor is fiirther configured 
to transmit a text viewer software application to the viewer. In still fiirther preferred embodiments, 

15 the processor is fiorther configured to receive feedback information from the viewer. 

DESCRIPTION OF THE FIGURES 

Figure 1 shows a schematic representation of one embodiment of the systems of the present 

invention. 

20 Figure 2 shows a schematic representation of a conference bridge configuration in one 

embodiment of the present invention. 

Figure 3 shows a schematic representation of a processor configuration in one embodiment 
of the present invention. 

Figure 4 shows a representation of a media player in one embodiment of the present 

25 invention. 

Figure 5 shows a schematic representation of system connectivity in one embodiment of 
the present invention. 

Figure 6 shows a schematic representation of a talk-show format using the systems and 
methods of the present invention. 
30 Figure 7 shows a schematic representation of a corporate meeting using the systems and 

methods of the present invention. 
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Figure 8 shows a schematic representation of the generation of translation and sub-titles for 
video using the systems and methods of the present invention. 

DEFINITIONS 

5 To facilitate an understanding of the present invention, a number of terms and phrases are 

defined below: 

As used herein the terms "processor" and "central processing unit" or "CPU" are used 

interchangeably and refer to a device that is able to read a program from a computer memory (e.g., 

ROM or other computer memory) and perform a set of steps according to the program. 
10 As used herein, the terms "computer memory" and "computer memory device" refer to any 

storage media readable by a computer processor. Examples of computer memory include, but are 

not Hmited to, RAM, ROM, computer chips, digital video disc (DVDs), compact discs (CDs), hard 

disk drives (HDD), and magnetic tape. 

As used herein, the term "computer readable medium" refers to any device or system for 
15 storing and providing information (e.g., data and instructions) to a computer processor. Examples 

of computer readable media include, but are not limited to, DVDs, CDs, hard disk drives, magnetic 

tape and servers for streaming media over networks. 

As used herein the terms ^'multimedia information" and "media information" are used 

interchangeably to refer to information (e.g., digitized and analog information) encoding or 
20 representing audio, video, and/or text. Multimedia information may further carry information not 

corresponding to audio or video. Multimedia information may be transmitted fi:om one location or 

device to a second location or device by methods including, but not limited to, electrical, optical, 

and sateUite transmission, and the like. 

As used herein the term "audio information" refers to information (e.g., digitized and 
25 analog information) encoding or representing audio. For example, audio information may 

comprise encoded spoken language with or without additional audio. Audio information includes, 

but is not limited to, audio captured by a microphone and synthesized audio (e.g., computer 

generated digital audio). 

As used herein the term "video information" refers to information (e.g., digitized and 
30 analog information) encoding or representing video. Video information includes, but is not limited 

to video captured by a video camera, images captured by a camera, and synthetic video (e.g., 

computer generated digital video). 
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As used herein the term "text information" refers to information (e.g. , analog or digital 
information) encoding or representing written language or other material capable of being 
represented in text format (e.g., corresponding to spoken audio). For example, computer code 
(e.g., in .doc, .ppt, or any other suitable format) encoding a textual transcript of a spoken audio 
5 performance comprises text information. Li addition to written language, text information may 
also encode graphical information {e.g., figures, graphs, diagrams, shapes) related to, or 
representing, spoken audio. "Text information corresponding to audio information" comprises text 
information (e.g., a text transcript) substantially representative of a spoken audio performance. 
For example, a text transcript containing all or most of the words of a speech comprises "text 
10 information corresponding to audio information." 

As used herein the term "configured to receive multimedia information" refers to a device 
that is capable of receiving multimedia information. Such devices contain one or more 
components that can receive a signal carrying multimedia information. In preferred embodiments, 
the receiving component is configured to transmit the multimedia information to a processor. 
15 As used herein the term "encode" refers to the process of converting one type of 

information or signal into a different type of information or signal to, for example, facilitate the 
transmission and/or interpretability of the information or signal. For example, audio sound waves 
can be converted into {i.e,, encoded into) electrical or digital information. Likewise, light patterns 
can be converted into electrical or digital information that provides and encoded video capture of 
20 the light patterns. As used herein, the term "separately encode" refers to two distinct encoded 
signals, whereby a first encoded set of information contains a different type of content than a 
second encoded set of information. For example, multimedia information containing audio and 
video information is separately encoded where video information is encoded into one set of 
mformation while the audio information is encoded into a second set of information. Likewise, 
25 multimedia information is separately encoded where audio information is encoded and processed 
in a first set of information and text corresponding to the audio information is encoded and/or 
processed in a second set of information. 

As used herein the term "information stream" refers to a linearized representation of 
multimedia information (e.g., audio information, video information, text information). Such 
30 information can be transmitted in portions over time (e.g. , file processing that does not require 
moving the entire file at once, but processing the file during transmission (the stream)). For 
example, streaming audio or video information utihzes an information stream. As used herein, the 
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term "streaming" refers to the network delivery of media. "True streaming" matches the 
bandwidth of the media signal to the viewer's connection, so that the media is seen in realtime. As 
is known in the art, specialized media servers and streaming protocols are used for true streaming. 
RealTime Streaming Protocol (RTSP, REALNETWORKS) is a standard used to transmit true 

5 streaming media to one or more viewers simultaneously. RTSP provides for viewers randomly 
accessing the stream, and uses RealTime Transfer Protocol (RTP, REALNETWORKS) as the 
transfer protocol. RTP can be used to deliver hve media to one or more viewers simultaneously. 
"HTTP streaming" or "progressive download" refers to media that may be viewed over a network 
prior to being fully downloaded. Examples of software for "streaming" media include, but are not 

10 limited to, QUICKTIME, NETSHOW, WINDOWS MEDIA, REALVIDEO, REALSYSTEM G2, 
and REALSYSTEM 8. A system for processing, receiving, and sending streaming information 
may be referred to as a "stream encoder" and/or an "information streamer." 

As used herein, the term "digitized video" refers to video that is either converted to digital 
format from analog format or recorded in digital format. Digitized video can be uncompressed or 

15 compressed into any suitable format including, but not limited to, MPEG-1, MPEG-2, DV, M- 
JPEG or MOV. Furthermore, digitized video can be delivered by a variety of methods, including 
playback from DVD, broadcast digital TV, and streaming over the Intemet. As used herein, the 
term "video display" refers to a video that is actively running, streaming, or playing back on a 
display device. 

20 As used herein, the term "codec" refers to a device, either software or hardware, that 

translates video or audio between its uncompressed form and the compressed form (e.g,, MPEG-2) 
in which it is stored. Examples of codecs include, but are not limited to, CINEPAK, SORENSON 
VIDEO, INDEO, and HEURIS codecs. "Symetric codecs" encodes and decodes video in 
approximately the same amount of time. Live broadcast and teleconferencing systems generally 

25 use symetric codecs in order to encode video in realtime as it is captured. 

As used herein, the term "compression format" refers to the format in which a video or 
audio file is compressed. Examples of compression formats include, but are not limited to, 
MPEG-1, MPEG-2, MPEG-4, M-JPEG, DV, and MOV. 

As used herein, the term "cHent-server" refers to a model of interaction in a distributed 

30 system in which a program at one site sends a request to a program at another site and waits for a 
response. The requesting program is called the "cHent," and the program that responds to the 
request is called the "server." In the context of the World Wide Web (discussed below), the chent 
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is a "Web browser" (or simply "browser") that runs on a computer of a user; the program which 
responds to browser requests by serving Web pages is commonly referred to as a "Web server." 

As used herein, the term "hyperlink" refers to a navigational link from one document to 
another, or from one portion (or component) of a document to another. Typically, a hyperlink is 
displayed as a highhghted word or phrase that can be selected by clicking on it using a mouse to 
jump to the associated document or documented portion. 

As used herein, the term "hypertext system" refers to a computer-based informational 
system in which documents (and possibly other types of data entities) are linked together via 
hyperlinks to form a user-navigable "web." 

As used herein, the term "Internet" refers to any collection of networks using standard 
protocols. For example, the term includes a collection of interconnected (pubUc and/or private) 
networks that are linked together by a set of standard protocols (such as TCP/IP, HTTP, and FTP) 
to form a global, distributed network. While this term is intended to refer to what is now 
commonly known as the Internet, it is also intended to encompass variations that may be made in 
the future, including changes and additions to existing standard protocols or integration with other 
media (e.g., television, radio, etc). The term is also intended to encompass non-public networks 
such as private (e.g., corporate) Intranets. 

As used herein, the terms "World Wide Web" or "web" refer generally to both (i) a 
distributed collection of interlinked, user-viewable hypertext documents (commonly referred to as 
Web documents or Web pages) that are accessible via the Internet, and (ii) the client and server 
software components which provide user access to such documents using standardized Internet 
protocols. Currently, the primary standard protocol for allowing apphcations to locate and acquire 
Web documents is HTTP, and the Web pages are encoded using HTML. However, the terms 
"Web" and "World Wide Web" are intended to encompass future markup languages and transport 
protocols that may be used in place of (or in addition to) HTML and HTTP. 

As used herein, the term "web site" refers to a computer system that serves informational 
content over a network using the standard protocols of the World Wide Web. Typically, a Web 
site corresponds to a particular Internet domain name and uicludes the content associated with a 
particular organization. As used herein, the term is generally intended to encompass both (i) the 
hardware/software server components that serve the informational content over the network, and 
(ii) the "back end" hardware/software components, including any non-standard or specialized 
components, that interact with the server components to perform services for Web site users. 
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As used herein, the term "HTML" refers to HyperText Markup Language that is a standard 
coding convention and set of codes for attaching presentation and Unking attributes to 
informational content within documents. During a document authoring stage, the HTML codes 
(referred to as "tags") are embedded within the informational content of the document. When the 

5 Web document (or HTML document) is subsequently transferred from a Web server to a browser, 
the codes are interpreted by the browser and used to parse and display the document. Additionally, 
in specifying how the Web browser is to display the document, HTML tags can be used to create 
links to other Web documents (commonly referred to as "hyperUnks"). 

As used herein, the term "HTTP" refers to HyperText Transport Protocol that is the 

10 standard World Wide Web chent-server protocol used for the exchange of information (such as 
HTML documents, and cUent requests for such documents) between a browser and a Web server, 
HTTP includes a number of different types of messages that can be sent from the client to the 
server to request different types of server actions. For example, a "GET" message, which has the 
format GET, causes the server to retum the document or file located at the specified URL. 

15 As used herein, the term "URL" refers to Uniform Resource Locator which is a unique 

address that ftiUy specifies the location of a file or other resource on the Internet. The general 
format of a URL is protocol ://machine address:port/path/filename. The port specification is 
optional, and if none is entered by the user, the browser defaults to the standard port for whatever 
service is specified as the protocol For example, if HTTP is specified as the protocol, the browser 

20 will use the HTTP defauh port of 80. 

As used herein, the term "PUSH technology" refers to an information dissemination 
technology used to send data to users over a network. In contrast to the World Wide Web (a "pull" 
technology), in which the chent browser must request a Web page before it is sent, PUSH 
protocols send the informational content to the user computer automatically, typically based on 

25 information pre-specified by the user. 

As used herein the terms "hve event" and "live media event" are used interchangeably to 
refer to an event that is to be captured in the form of audio, video, text, or multimedia information, 
wherein the captured information is used to transmit a representation of the event (e.g., a video, 
audio, or text capture of the event) to one or more viewers in real time or substantially real time 

30 (i.e., it will be appreciated that delays on the order of seconds to minutes may be incurred in the 
capture, delivery, and/or processing of information prior to its display to viewers while still 
considering the display of the event as a "hve" event). As used herein, "live event audio" refers to 
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audio from a live event that is captured as audio information and transmitted, in some form, to a 

viewer in real time. As used herein, "live educational event" refers to a live event featuring an 

educational component directed at the viewer. 

As used herein the term "televised event" refers to an event that is televised or is intended 
5 to be televised, "Televised audio" refers to the audio portion of a televised event, including, for 

example, spoken language and soxmds, as well as music and sound effects. Television audio may 

be converted to information (e.g., multimedia or audio information). 

As used herein the term "motion picture event" refers to an event that is incorporated into a 

motion picture or is intended to be incorporated into a motion picture. Motion picture events 
10 include material already captured in the form of video or film, as well as hve events that are to be 

captured on video or film. "Motion picture audio" refers to the audio portion of a motion picture 

event, including, for example, the audio content of a soundtrack and voiceover in a completed 

motion picture. 

As used herein the term "event audio" refers to the audio component of an event. Events 
15 include any live performance, prerecorded performance, and artificially synthesized performance 
or any kind (e.g., any event or material that contains speech). 

As used herein the term "distinct locations" refers to two or more different physical 
locations where viewers can separately view a multimedia presentation. For example, a person 
viewing a presentation in one location (e.g. , on a video monitor) would be in a distinct location 
20 from a second person viewing the same presentation (e.g., on a different video monitor) if the first 
and second persons are located in different rooms, cities, countries, and the like. 

As used herein the term "speech to text converter" refers to any system capable of 
converting audio into a text representation or copy of the audio. For example, a stenographer 
Ustening to spoken language from an audio source and converting the spoken language to text 
25 using a stenograph comprises a speech to text converter. Likewise, a speech-to-text software 
apphcation and the appropriate hardware to run it would be considered a speech to text converter 
(See e.g., U.S. Patent Nos. 5,926,787, 5,950,194, and 5,740,245, herein incorporated by reference 
in their entireties). A system that is "configured to produce text from multimedia information" 
contains a component that receives multimedia information and a component that provides speech 
30 to text conversion. 

As used herein the term "text to speech converter" refers to any system capable of 
converting text or text information into spoken audio. For example, a text-to-speech software 



13 



application and the appropriate hardware to run it would be considered a text to speech converter. 
In some embodiments of the present invention, a single system may have text to speech and speech 
to text conversion capabihties. A system that is capable of processing "at least a portion of text 
information" is a system that can recognize all, or a portion of a text document or text information, 

5 and process the text or information {e,g. , convert the text to audio). 

As used herein the term "error corrector" refers to a system that contains a component 
capable of reviewing text converted from audio to confirm that accuracy of the conversion. If 
errors were made in the audio to text conversion, the error corrector identifies and corrects the 
errors. For example, a human reviewer of a previously computer generated speech to text 

10 transcript comprises an error corrector. A system that is "configured to confirm text accuracy" is a 
system that contains the appropriate components to allow an error corrector to review a speech to 
text translation. For example, where the correction is being conducted by a human reviewer, the 
system may comprise a display system for displaying the original conversion to the reviewer, an 
audio playback system for the reviewer to listen to the original audio, and a data input system for 

15 the reviewer to correct errors in the original conversion. 

As used herein the term "security protocol" refers to an electronic security system (e.g., 
hardware and/or software) to limit access to processor to specific users authorized to access the 
processor. For example, a security protocol may comprise a software program that locks out one 
or more fiinctions of a processor until an appropriate password is entered. 

20 As used herein the term "conference bridge" refers to a system for receiving and relaying 

multimedia information to and from a plurality of locations. For example, a conference bridge can 
receive signals from one or more Hve events {e.g., in the form of audio, video, multimedia, or text 
information), transfer information to a processor or a speech-to-text conversion system, and send 
processed and/or unprocessed information to one or more viewers connected to the conference 

25 bridge. The conference bridge can also, as desired, be accessed by system administrators or any 
other desired parties. 

As used herein the term "viewer" refers to a person who views text, audio, video, or 
multimedia content. Such content includes processed content such as information that has been 
processed and/or translated using the systems and methods of the present invention. As used 
30 herein, the phrase "view multimedia information" refers to the viewing of multimedia information 
by a viewer. "Feedback information from a viewer" refers to any information sent from a viewer to 
the systems of the present invention in response to text, audio, video, or multimedia content. 
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As used herein the term "resource manager" refers to a system that optimizes the 
performance of a processor or another system. For example a resource manager may be 
configured to monitor the performance of a processor or software application and manage data and 
processor allocation, perform component failure recoveries, optimize the receipt and transmission 

5 of data {e.g., streaming information), and the Hke. In some embodiments, the resource manager 
comprises a software program provided on a computer system of the present invention. 

As used herein the term "delay component" refers to a device or program that delays one or 
more components of transmitted multimedia information. Delay components find use, for 
example, in delaying one portion of a multimedia signal to allow a separate portion (e.g., a 

10 separately processed portion) to be reaUgned with the first portion prior to displaying the 

multimedia content to a viewer. For example, an audio portion of multimedia information may be 
converted to text and one or more of the information components is delayed such that a viewer of 
the multimedia content is presented with a real time performance of the audio, video, and text. 
The phrase "delay at least a portion of multimedia information" refers to delaying at least one 

15 component of multimedia information, while optionally delaying or not delaying other components 
(e.g., delaying audio information, while delaying or not delaying corresponding video 
information). 

As used herein the term "language translator" refers to systems capable of converting audio 
or text from one language into another language. For example, a language translator may 
20 comprise translation software {e.g. , software that is capable of converting text in one language to 
text in another language). Language translators may fiirther comprise an error correction system. 

As used herein the term "viewer output signal" refers to a signal that contains multimedia 
information, audio information, video information, and/or text information that is dehvered to a 
viewer for viewing the corresponding multimedia, audio, video, and/or text content. For example, 
25 viewer output signal may comprise a signal that is receivable by a video monitor, such that the 
signal is presented to a viewer as text, audio, and/or video content. 

As used herein, the term "compatible with a software appHcation" refers to signals or 
information configured in a manner that is readable by a software apphcation, such that the 
software apphcation can convert the signal or information into displayable multimedia content to a 
30 viewer. 

As used herein the term "distinct viewing field" refers to a viewer display comprising two 
or more display fields, such that each display field can contain different content from one another. 
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For example, a display with a first region displaying video and a second region displaying text 
{e,g., a text box) comprises distinct viewing fields. The distinct viewing fields need not be 
viewable at the same time. For example, viewing fields may be layered such that only one or a 
subset of the viewing fields is displayed. The undisplayed viewing fields can be switched to 

5 displayed viewing fields by the direction of the viewer. 

As used herein the term "in electronic communication" refers to electrical devices (e.g., 
computers, processors, conference bridges, communications equipment) that are configured to 
communicate with one another through direct or indirect signaling. For example, a conference 
bridge that is connected to a processor through a cable or wire, such that information can pass 

10 between the conference bridge and the processor, are in electronic communication with one 

another. Likewise, a computer configured to transmit (e.g., through cables, wires, infirared signals, 
telephone lines, etc) information to another computer or device, is in electronic communication 
with the other computer or device. 

As used herein the term "transmitting" refers to the movement of information (e.g., data) 

15 fi-om one location to another (e.g., from one device to another) using any suitable means. 

As used herein, the term "adminstrator" refers to a user of the systems of the present 
invention who is capable of approving customer registrations and event requests and/or a user with 
privileges to reconfigure the main content. 

As used herein, the term "captionist" refers to a user of the systems of the present invention 

20 that transforms audio into captions and/or transcripts, typically using a stenograph-like device and 
appropriate software. 

As used herein, the term "customer" refers to a user (e.g., a viewer) of the systems of the 
present invention that can view events and request services for events and/or pay for such services. 

As used herein, the term "player" (e.g, multimedia player) refers to a device or software 
25 capable of transforming information (e.g., multimedia, audio, video, and text information) into 
displayable content to a viewer (e.g., audible, visible, and readable content). 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention comprises systems and methods for providing text transcripts of 
30 multimedia events. For example, text transcripts of live or pre-recorded audio events are generated 
by the systems and methods of the present invention. The audio may be a component of a more 
complex multimedia performance, such as televised or motion picture video. Text transcripts are 
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made available to viewers either as pure text transcripts or in conjunction with audio or video (e.g., 
audio or video from which the text was derived). In some preferred embodiments of the present 
invention (e.g., for hve events), text is encoded in an information stream and streamed to a viewer 
along with the audio or video event. In some such embodiments, the text is configured to be 
5 viewable separate from the media display on a viewer's computer. In yet other preferred 

embodiments, the text is provided to the viewer in a manner that allows the viewer to manipulate 
the text. Such manipulations include copying portions of the text into a separate file location, 
printing the text, and the Uke. 

The systems and methods of the present invention also allow audio to be translated into one 
10 or more different languages prior to dehvery to a viewer. For example, in some embodiments, 
audio is converted to text and the text is translated into one or more desired languages. The 
translated text is then delivered to the viewer along with the original audio-containing content. In 
C3 some embodiments, the text is re-converted to audio (e.g., translated audio) and the audio is 

■fi streamed to the viewer, with or without the text transcript. 

^ |J 15 The systems and methods of the present invention find use in numerous appHcations, 

fU including, but not limited to, the generation of text from hve events (e.g., speeches), televised 

]| events, motion pictures, live education events, legal proceedings, text for hearing impaired 

individuals, or any other appUcation where a speech-to-text or audio-to-text conversion is desired. 
CO Certain preferred embodiments of the present invention are described in detail below. 

20 These ilhistrative examples are not intended to Umit the scope of the invention. The description is 
Q provided in the following sections: I) Information Processing Systems and II) Applications. 

I) Information Processing Systems 

The present invention provides systems for processing media events to generate text from 
25 an audio component of a media event and to process, as desired, and dehver the text to a viewer. 
One preferred embodiment of the systems of the present invention is diagrammed in Figure 1. 
Figure 1 shows a number of components, including optional components, of the systems of the 
present invention. In this embodiment, the audio information of a media event is transferred to a 
conference bridge. Audio information received by the conference bridge is then sent to one or 
30 more other components of the system. For example, audio information may be sent to a speech-to- 
text converter (e.g., a captionist/transcriptionist and/or voice recognition software) where the audio 
is converted to text. The media information received by the conference bridge may also be sent 
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directly to a processor that encodes the audio for delivery to a viewer (e.g. , compresses the audio 
and/or video components of multimedia information into streaming data for deUvery to a viewer 
over a pubhc or private electronic communication network). Text information that is generated by 
the speech-to-text converter is also sent to the processor for delivery to a viewer. In preferred 

5 embodiments, the text information is encoded in a separate delivery stream than the audio or video 
components of the multimedia information that is sent to a viewer. The text information, as 
desired, can be translated into one or more different languages. For example, in Figure 1, the 
encoded text stream is translated using a real-time language translator (e.g., SysTran, Enterprise). 
Processed multimedia information and text streams may be delivered directly to one or 

10 more viewers or the multimedia information may be deUvered through an intermediary (e.g. , 
through one or more electronic network service components including, but not limited to, web 
servers, databases, and information streamers). In some embodiments, the multimedia and text 
information is configured to be readable by a media player of a viewer. In some embodiments, the 
text information is configured to be readable by a separate text viewer appUcation. The separate 

15 text box may be provided as a computer program, distinct from the media player or may be 

integrated with a media player. In some such embodiments, a player application is deUvered to, or 
accessed by the viewer. The text received by the viewer can further be re-converted to audio. For 
example, streaming audio generated from text by a processor of the present invention may be sent 
to a viewer with or without the corresponding text. This has particular apphcation where the text 

20 has been translated into a language of the viewer (e.g., where the language of the viewer is 
different than the language of the original audio event). In some preferred embodiments, the 
system of the present invention is configured to receive feedback from the viewer (in the form of 
comments or questions). The feedback can occur through any suitable means, including, but not 
limited to, web based email, a question queue integrated with the media player or text display 

25 application, and direct call-in through the conference bridge (e.g., using either voice-over-IP or 
public switched network). The question queue can be run through the language translator in both 
directions (e.g., questions from the viewer to a screener or moderator, and all approved questions 
refreshed back to all viewers are translated to the language of each participant exposed to the 
material). 

30 In some preferred embodiments, one or more (or all) of the components of the invention 

are automated. For example, in some embodiments, participants in the event to be transmitted 
(e.g., a live event) and viewers simply access the systems of the present invention through a web- 
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based interface. No addition human interaction is necessary to manage the processor or 
mformation processing components of the present invention. Once accessed, the event can 
proceed, with streaming text information from the event being sent to the viewer, and optionally, 
with feedback (e.g., questions/comments) from viewers being made available to participants and 
5 other viewers in any desired format and in any number of languages. 

A. Media Events 

The present invention finds use with a wide variety of media events, including live and 
non-live events (e.g., transcription / translation from pre-recorded media). Any event that contains 
10 an audio component that can be converted to text finds use with the systems and methods of the 
present invention. Such events include, but are not limited to, live speeches (e.g., poUtical 
speeches), news events, educational events (e.g., educational events for distance leaming), live or 
|:3 pre-recorded video (e.g., television, motion pictures, etc), artistic performances, radio 

5? performances, legal proceedings, talk-shows, and the like. The present invention may be used for 

= ; 15 interactive events, wherein information is continuously received, processed, and delivered to 
fO participants and viewers. 

;^ B. Conference Bridge 

IB In some embodiments of the present invention, a conference bridge is employed to manage 

20 incoming content, including nmltimedia information (e.g, audio information) as well as viewer 
O feedback (e.g. , in the form of live call-in comments and questions, and the hke). The conference 

bridge can be configured to deliver incoming information to other components of the system, 
including speech-to-text converters and processors. In some embodiments of the present 
invention, only the audio information component of the multimedia information generated by an 
25 event is processed through the conference bridge. In other embodiments, video or other 

multimedia components are also processed through the conference bridge. The conference bridge 
may contain one or more devices that allow information from different sources to be received 
simultaneously or at different times. For example, the conference bridge can be configured to 
receive digital or analog audio information from sources including, but not limited to, telephone 
30 Unes, cable connections, satelHte transmissions, direct connections to microphones, and the like. 

An example of a conference bridge that finds use in an interactive talk-show format is 
diagrammed in Figure 2. In this example, multimedia information generated at a live event is 
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transmitted to the conference bridge. The multimedia information includes audio from a 
moderator and participants of the hve event. Audio information can also be received from one or 
more remote recipients. Viewers (e.g., call-in viewers) of the talk-show can also send audio 
information to the conference bridge. As desired, the information content from the call-in viewers 

5 can be screened to determine if it is appropriate to disseminate to other viewers or participants. In 
such embodiments, a call-in screener is connected to the conference bridge such that the call-in 
screener monitors the call-in audio from the viewers prior to it being heard or viewed by other 
viewers or participants. The conference bridge can be configured to allow different levels of 
access and information processing. For example, the event participant audio information can 

10 automatically be processed to text, while the call-in viewer audio is originally directed to a private 
call-m virtual conference, monitored, and only sent to the hve virtual conference for text 
conversion if approved by the screener. Information that is to be converted to text is sent to a 
speech-to-text converter. The speech-to-text converter need not receive the video of the hve event, 
but can simply be sent the audio (e.g., through the conference bridge) that is to be converted to 

15 text. Additional participants may also be coimected to the conference bridge including a system 
administrator or operator. The control of the conference bridge can be operated directly or over a 
communications network. For example, all of the moderator, participant, and administrator 
ftinctions can be controlled over the World Wide Web. 

The conference bridge is connected to a processor or processors that encode the audio 

20 information for dehvery to one or more viewers, and broadcast the streaming text from the same 
processor(s) (server) or from a dedicated server. MuUimedia information received by the 
conference bridge is sent to the processor through any appropriate connection (direct or indirect, 
e.g.. Intranet). For example, information may be transmitted or sent through a direct connection 
(e.g., through a cable connected to a T-1 of the conference bridge, through an intermediate Lucent 

25 PBX to convert it back to analog, and then to a sound card input of a computer containing the 
processor). In some embodiments, text information is sent directly from the speech-to-text 
converter to the processor. In embodiments where the text information and multimedia 
information (e.g., audio information) are to be simultaneously sent to a viewer, the multimedia 
information may need to be delayed in order to aUgn the text to the multimedia information. This 

30 can be accomplished, for example, through the use of a delay component (e.g., an audio delay 
device, e.g.. Prime Image Pick-2) during the transmission of the multimedia information from the 
conference bridge to the processor. The audio information may also be boosted using an amplifier 



20 



{e.g,, to provide a strong signal or to normalize audio levels from different sources, e.g., ATI MM- 
100 amplifier). 

In preferred embodiments (e.g., for high usage and automated systems), the conference 
bridge should be able to automatically answer dial-in phone calls. During the development of the 
5 present invention, it was determined that the analog inputs of Lucent Legend systems were not 
suitable for automatic answering. To allow automated answering, an Innkeeper 1 system (Digital 
Hybrid, JK Audio) was utihzed. This system provides the further advantage of providing built-in 
audio amplification. 

In some embodiments of the present invention wireless systems are used to receive 
10 information from events and/or to transmit information between components of the systems of the 
present invention or between the systems of the present invention and viewers (See e.g., U.S. 
Patent Nos. 6,097,733, 6,108,560, and 6,026,082, herein incorporated by reference in their 
entireties). In some preferred embodiments, streaming text generated by the systems and methods 
of the present invention is sent to viewers using a wireless communication system. To facihtate 
15 transfer, the present invention provides JAVA encoded streaming text information. JAVA 

encoding creates small files that permit real-time transmission of text information of the present 
invention. In some embodiments, streaming text is sent to handheld wireless devices of a user. 
Such methods find particular use with hearing impaired individuals, who can receive real-time 
streaming text of any live audio event on a wireless handheld device. 

20 

C. Speech-to-Text Converter 

Speech to text conversion is accomplished using any suitable system. For example, in 
some embodiments of the present invention, speech-to-text conversion is carried out using a 
human captionist/transcriptionist. In such embodiments, the captionist listens to audio and 

25 encodes a text transcript of the audio {e.g., using a stenograph machine and stenographic software). 
The captionist need not be located at the site of the event or at the location of the conference 
bridge or processor. For example, in some embodiments, audio information is transmitted to the 
captionist and text information recorded by the captionist is transmitted to the processor (e.g., over 
an electronic communication network), 

30 Speech to text conversion can also be carried out using voice recognition hardware and/or 

software. Audio information can be sent directly to the voice recognition system or can be pre- 
processed. Pre-processing may be desired to, for example, remove or reduce unwanted non-speech 



21 



audio information or modify the audio information to maximize the performance of the voice 
recognition system. 

In some embodiments, an error corrector is used to improve the accuracy of the speech to 
text conversion. Error correction can occur, for example, through the use of human and/or 

5 software transcription. For example, in some embodiments, text generated using voice recognition 
software is monitored by a human. Errors are identified and/or corrected. Where text is being 
streamed in real time or near real time, subsections of the text are reviewed for errors and 
corrected, allowing accurate text to be passed to the viewer in the minimum amount of time. In 
some embodiments, of the present invention, uncorrected text is sent in real-time to the viewer, 

10 while a corrected, more accurate version is made available at a later time (e.g., later during the 
event, immediately following the event, or after the event). 

In some embodiments, once the corrected copy of the transcript is complete, language 
translations are re-applied and one or more language versions are made available to the customer 
(e.g., via email or secured web site). Text information generated by the speech-to-text converter 

15 and/or language translator is sent to a processor for fiirther processing and delivery to one or more 
viewers. 

In some embodiments, the present invention provides efficient systems for transferring 
information from a captionist to the processors of the present invention. Traditionally, human 
captionists prepare text transcripts using a software apphcation on a personal computer. Using 

20 standard appUcations, there is limited flexibility in transferring the text from the personal computer 
of the captionist to other locations (e.g., to other computers). Presently, software programs such as 
CASEVIEW (Stenograph Corporation, Mount Prospect, IL) allow users access to the text 
information if a direct hardware link to the computer of the captionist is made. The present 
invention provides hardware and/or software solutions to facilitate flow of text information from 

25 captionists to the processors of the present invention (e.g., to a text streaming server of the present 
invention). In some such embodiments, a serial to IP device (e.g., products available from Precidia 
Corporation, Canada, including but not limited to the Cypher ASIC) is attached to a serial port of 
the captionist computer to allow direct access to the Intemet or other TCP/IP electronic 
environment. In preferred embodiments, a software apphcation is employed to carry out the 

30 fimction of the serial to IP device, without the need for hardware. In particularly preferred 

embodiments, the software application is linked to the operating system of the captionist computer, 
allowing any type of captioning software to be used. In some preferred embodiments, the output 
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signal is an ASCII format. In some embodiments, SERIAL-IP software (Tactical Systems) is used 
in place of a serial to IP device. 

The present invention also provides systems and methods for recognizing any format of 
data obtained from captionists. Captionists use different software protocols encoding text (e.g., 
5 CASEVIEW, SMARTENCODER, TAG, SMULVIEW). The streaming text encoded by the 
present invention should be of a single format. Therefore the present invention provides parsing 
routines to identify and/or convert text obtained from different captioning protocols into a single 
standard format. In some embodiments, captionists that use the systems and methods of the 
present invention are registered, such that the protocol used by the captionist is known and the 
10 correct parsing routine is used with information obtained from the captionist. In some 

embodiments, a database is maintained that correlates particular captionists to particular parsing 
routines. In other embodiments, there is no preexisting knowledge of the captionists protocol. In 
Q such embodiments, information obtained from the captionist is monitored to identify one or more 

J! unique characteristics of a particular protocol. Once identified, the information fi*om the captionist 

15 is routed (e.g., automatically) through the appropriate parsing routine. The parsing routines of the 
flj present invention comprise software applications that receive captionist information and, based on 

5 known information encoding characteristics of the captionist information (e.g., available from the 

1. manufactures of the captioning software), alter the information into a format suitable for use with 

£0 the present invention (e.g., create grammatically correct and coherent text information for using in 

% 20 the generation of streaming text files). 

; 

D. Processors 

As shown in Figure 3, multimedia information is received by a processor through a 
conference bridge and/or from a speech-to-text converter and converted to an appropriate format to 

25 allow useful dehvery to one or more viewers. For example, in some embodiments of the present 
invention, streaming media is used to provide audio, video, and text to viewers. In such 
embodiments, the processor encodes one or more information streams from the audio and/or video 
information of the multimedia information. The processor also encodes (e.g., separately) a text 
stream. The text and multimedia information are then sent, directly or indirectly, to one or more 

30 viewers. 

Prior to delivery to viewers, the media and/or text information may be further processed, as 
desired. For example, in some embodiments, text is translated using any suitable language 
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translator system (e.g., foreign language real-time translation software, e.g., SysTran). In some 
embodiments, where text is being sent in real time to viewers, each sentence is translated before 
sending the individual words of the sentence to the viewer. This allows for grammatically accurate 
translations. For Uve events, translated text is refreshed at one or more intervals to update the 

5 translated information received by a viewer during the Uve event. 

During the development of the present invention, it was determined that applying text 
directly from a text-to-speech converter often did not provide sufficient text quality to allow 
accurate translations. To address this problem, a series of experiments were performed. It was 
determined that a three-step process could be apphed to generate text that provides accurate 

10 translations. The first step appUes a capitahzation check to determine if proper nouns are 

capitalized. This step is conducted by 1) determining if a candidate word appears in a dictionary 
of a spell checking software apphcation (e.g., MICROSOFT WORD 2002); if not, assign a 
positive score; 2) checking the neighboring words on either side of the candidate word to 
determine if they are capitahzed; if so, assign a positive score; and 3) determining if the candidate 

15 word appears in a dictionary of a spell checking software apphcation as a proper noun; if so, assign 
a positive score. If either the first and second or second and third factors result in a positive score, 
the candidate word is capitalized. If only one of the factors results in a positive score, 
capitahzation is dependent on the nature of the source of the text. For text that is considered 
"high" in proper nouns (e.g., source of the text is a news broadcast), the candidate word is 

20 capitalized even if only one of the factors resufts in a positive score. A scoring system inteUigence 
may be applied based on experience with types of text (e,g., pohtical speech, corporate speech, 
educational speech, entertainment content) or with text from a specific individual. This scoring 
system is developed, for example, through empirical testing, weighing each of the factors at the 
appropriate level to achieve the most accurate resuhs (e.g., for a specific individual, factor one 

25 may be assigned a +1 [not in the dictionary] or 0 [in the dictionary] and given a multipUer score of 
1 .5; factor two may be assigned a +1 [neighboring word is capitalized] or -1 [neighboring words 
are not capitalized] and given a multipher score of 0.8; factor three may be assigned a +1 [appears 
as a proper noun in the dictionary] and given a multipher score of 2.0; with a positive sum of the 
three factors resulting in the selection of capitahzed version of the candidate word). Scoring 

30 system inteUigence may be stored in a database for use in automatically assigning the appropriate 
inteUigence scoring system to the specific individual or type of speech being translated. The 
identity of the source of the speech can be identified, for example, upon login. 
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The capitalization checked text is then applied to the second step. The second step apphes 
a spell checking software application (e.g., MICROSOFT WORD 2002) for general spell 
checking. For automated systems, if the software apphcation indicates an incorrect spelling and a 
suggested spelling is available, the highest probability suggested spelling is selected. 

5 The spell-checked text is then applied to the third step. The third step applies a grammar 

checking software apphcation (e.g., MICROSOFT WORD 2002) for general grammar checking. 
For automated systems, changes are only made if a suggested correction is available. Thus, items 
such as converting contractions into non-contraction, spacing, and punctuation are corrected. Text 
that has undergone all three steps is then ready for translation. In preferred embodiments, where 

10 any change is made in the text during any of the steps, a log is created, documenting the changes to 
allow concurrent or later inspection (e.g., to allow manual correction of missed errors, to cancel 
erroneously text and/or to track the effect of changes in the correction protocol). 

During the development of the present invention, it was determined that audio from a 
multimedia event would often be to low in level for use in encoding streaming audio for use by a 

15 viewer. To compensate, in some embodiments of the present invention audio amplification is 
applied to the audio information prior to encoding the information into an information stream. 
Likewise, during the development of the present invention, it was determined that audio 
information should be delayed so that ahgnment of text information and audio streams can be 
properly carried out. Audio ampUfication and delay alignment of multimedia information with 

20 text information can be carried out by the processor or by systems connected to the processor (e.g., 
analog or digital amplifiers and delays positioned between the conference bridge and processor). 

The efficiency of the processor may be monitored and controlled by a resource manager 
(e.g., Robo-Cop, Expert Systems). In some embodiments, the resource manager comprises a 
software program provided on a computer system of the present invention. For example, a 

25 software apphcation that performs component failure recoveries and optimizes the receipt and 
transmission of data (e.g., streaming information) may be used. In some embodiments of the 
present invention backup hardware and software components are provided with the system. If the 
resource manager detects a problem with hardware or software, the backup system is engaged. 
During the development of the present invention, it was found that resource management was 

30 required to provide scalability to allow a large number of multimedia events to be processed 

simultaneously. Without the resource manager, operation had to be conducted using human labor, 
making the process unacceptably inefficient. In particular, management of resource allocation. 
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resource balancing, and component-failure recovery was needed, wherein the resource manager 
automatically assigns tasks and allocations to processor components and automatically performs 
component recoveries. 

In some preferred embodiments, the audio information from a media event is received by 

5 the processor {e.g. , through a multi-link conference bridge / Lucent analog port, with ampUfication 
and delay). This information is then converted into streaming information in two different 
formats, MICROSOFT (a first format) and REAL (a second format), using the separate encoders 
(other formats such as QuickTime may be implemented). In preferred embodiments, the processor 
has a dedicated sound card for each of the encoders. The encoded information is then available to 

10 send to MICROSOFT and REAL streaming servers, for ultimate delivery to viewers. Optionally, 
digital rights management (DRM) encryption can be applied to the information (e.g., for the 
Microsoft encoded media stream). Text information sent from a speech-to-text converter is 
received by a text processor^roadcaster. The text is translated to the desired language(s) and 
encoded in a streaming format for delivery (e.g., simultaneous delivery) to one or more FTP 

15 servers and/or directly to the viewers. For example, in some embodiments, text is streamed to the 
viewers by a process using multiple IP sockets (a different socket for each translated language and 
one for EngUsh). The current accumulative complete transcript is sent at preset time intervals to 
the selected FTP server(s) (one copy of the transcript for each translated language and the original 
English). 

20 Access to and control of the processor and/or the conference bridge can be limited to 

system administrators through the use of security protocols. For example, it is sometimes 
desirable to prevent viewers from having access to and control of the processors or conference 
bridge. Where the processors and/or conference bridge is controlled remotely, a software 
appUcation that provides password-based access to the control operations is be provided. 

25 The processor may be configured to run any number of additional tasks including, but not 

limited to, acting as a web or information server, allowing data storage and management, 
providing streaming servers, and allowing storage and downloading of software apphcations (e.g., 
multimedia players, text viewers, etc.). Any one or more of the processor fimctions may be 
provided with a single processor (e.g., in a single computer) or with a plurahty of processors. 

30 Where multiple processors are used, the processor may be in electronic communication with one 
another through, for example, direct connections, local area networks, and/or long distance 
electronic communications networks. 
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In some embodiments of the present invention, the systems and methods are scaled to 
allow large volumes of multimedia and text information to be processed. In some large-scale 
systems, the processor is spUt into two different functions, each of which may be located on 
different computers and/or in different physical locations. The first portion is a captioning server 
that receives text information and processes the text information as desired (e.g., translates text 
information). The second portion is a text streaming server that transmits streaming text to 
viewers. In some embodiments, the captioning server is managed by a service provider and a 
plurality of text streaming servers are in the possession of one or more customers of the service 
provider. Such a system allows the service provider to handle extremely large volumes of text 
information and transmit the text information to dedicated text streaming servers for each of its 
customers. The text streaming servers may then individually send large numbers of text streams to 
multiple viewers. An illustrative example of such a system is provided below. 

1. Example 

The following example provides a highly scalable Web-based text-streaming platform. 
This system is built using MICROSOFT development tools, including .NET and BIZTALK Server 
2000. The system uses MICROSOFT'S SQL Server 2000 as the underling database and 
WINDOWS 2000 server products for foundational infrastructure support. In test conducted during 
the development of the present invention, the system benchmarked 4000 continuous text streams 
on a single 600 MHz server. Each text stream only consumes 300 bps of bandwidth. Thus, over 
5000 streams can be achieved on one T-1 (L544 mbps). The system was designed using a multi- 
threaded model to support an unlimited number of concurrent multi-user live events, and to 
support many hxmdreds of thousands to millions of viewers for any one event. The system 
provides and may be integrated with many beneficial features including, but not limited to: 

1) Integration with a small (e.g., less than 7k) Java- Applet text stream player, which 
can be configured to automatically download in a few seconds to a viewer's 
computer without requiring any installation. The player can be integrated to any 
web page with a few lines of simple HTML code. 

2) A fully ftinctional multi-stream server engine is readily installed into any hosting or 
customer infrastructure, while the captioning server is maintained by a service 
provider. This allows the service provider to maintain control over content and 
allows maximum flexibility to meet customer requirements. 
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3) The text stream can be delivered independent of any other media (audio/video) 
streaming technology, and can run along side any media player (e.g., REAL, 
MICROSOFT, QUICKTIME). 

4) Unlimited live events. 

5) Unlimited event viewers, per one single event. 

6) 200 Uve captionists supported by a single caption server, with the system operating 
with unlimited numbers of caption servers, as desired. 

7) About 10,000 Uve text streams supported by a single text broadcast server (e.g., 
using a 1 GHz dual processor). 

In practice, audio is acquired from a Uve event at the venue of the live event using a 
standard public switched network dial phone or wireless system. The audio is delivered to a 
speech-to-text converter and the processor of the present invention using a conference bridge. The 
encoded event is used for direct transmission to viewers or for subsequent on-demand replay with 
streaming text. In some embodiments, a captionist hstens to live audio and captions every word 
using a standard court reporter stenograph machine connected to a personal computer running 
captioning software. An interface of the present invention deUvers the stream of text from the 
captionist to the streaming server. English text is filtered with a grammar correction module and is 
optionally sent to a real-time language translator. The streaming server deUvers the text stream to 
a text stream player located on a viewer's computer. The text steam player allows controls such as 
text security, language selection, font, color, character size, and viewer interaction (described in 
detail below). The process is ftiUy automated and controlled using Expert Systems techniques 
provided by MICROSOFT'S BIZTALK Server 2000, for automated process setup and component 
failure recoveries. 

E. Information Flow to and from Viewers 

Multimedia and text information is received by viewers through any suitable 
coromunication network including, but not limited to, phone connections, the Internet, cable 
connections, satellite transmissions, direct connections, and the like. A playback device of a 
viewer receives multimedia and text information. For example, where multimedia information is 
sent in MICROSOFT or REAL streaming format, viewers access the appropriate streaming server 
and received streaming information that is played by a MICROSOFT media or REAL media 
player software apphcation on a playback device (e.g., computer, personal digital assistant (PDA), 
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video monitor, television, projector, audio player, etc.). Text information may also be received 
using any application that can receive and display (e.g., separately) both multimedia and text 
information (e,g., using a streaming text Java applet). In some embodiments of the present 
invention, text box display software (e.g,, SPECHE BOX from SPECHE COMMUNICATIONS) 
5 is provided to the viewer. The present invention contemplates the use of software to add text- 
viewing capabilities to preexisting media player software or to provide a stand-alone text viewer 
(e.g., using a text streaming Java applet) to be used separately but in conjunction with a media 
player. 

An example of a media player that finds use with the present invention is shown in Figure 

10 4. This media player contains a viewer screen for viewing video and a separate text box. Figure 4 
shows the use of the media player in conjunction with the motion picture "Sleepless in Seattle." 
The video and audio are controlled by the panel under the video screen that allows for starting, 
stopping, fast forward, reverse, and volume control The text box displays the name of the 
speakers, or their title, and provides a text transcript of their spoken audio. Controls under the text 

15 box allow the text to be viewed in different languages and allow the audio to be changed to the 
language selected. The viewer using the media player can select the option "view transcript" 
which opens a separate text box containing the current accumulative transcript in the language 
selected. This text box can be configured to allow text to be edited, copied, printed, searched and 
otherwise manipulated. The top of the media player also includes a box for the viewer to enter 

20 comments/questions and send them back to a question queue on the database. The present 

invention provides a web-based control for event screening, approval and prioritizing of viewer 
entered comments/questions. In this case, comments/questions are entered as text and are 
processed through the systems of invention, although they could also be sent as voice-over-EP 
audio, public switched network (telephone) audio, email, or in any other desired format. The 

25 systems of the present invention are also configured to allow other viewers to view event approved 
comments/ questions . 

In some embodiments, language translation is applied to the questions/comment 
information. For example, in some embodiments text entered by each viewer is translated to the 
native language of the screener at the event (to facilitate accurate control and screening). All text 

30 in the question queue on the database (originally entered by viewers in many different languages) 
are translated to each viewers' "Selected Language" and refreshed to their browsers as the screener 
processes new text. In this way, each viewer deals with all information (audio, streaming script 
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text, completed or accumulative transcripts, and comments/questions) in a selected (preferred) 
language. 

Figure 5 shows one example of a system configuration of the present invention. Audio 
information is passed from a conference bridge to a speech-to-text converter. The multimedia 

5 information from the conference bridge and the text information from the speech-to-text converter 
are sent to a processor where the media and text are separately encoded into streaming 
information. The processor is connected to a web server (e.g., a web server comprising FTP, IIS, 
and C52K servers), databases, and streaming media servers through a network (e.g., a local area 
network (LAN)). Streaming audio and video information are sent from the processor to the 

10 streaming media server and streaming text is sent to a Java applet running on the viewers' browser. 
A media player (e.g., custom SPECHE BOX software with embedded media player, SPECHE 
COMMUNICATIONS) viewable by a viewer receives the text and multimedia information and 
displays the multimedia performance and text to a viewer. The viewer can opt to "view 
transcript," which sends a request to an FTP server to supply the ftiU transcript (e.g., the ftiU 

15 transcript as generated as of the time the viewer selected the option) to the viewer. The viewer can 
also send information (e.g., comments/questions) back to the processor. In the embodiment shown 
in Figure 5, a data control system (e.g., one or more computers comprising a processor and/or 
databases) allows the viewer to register, provides schedule information on the event, and receive 
viewer question information. Storage of viewer information in a database at registration allows 

20 viewer preferences to be determined and stored so that dehvered content is correct for each 

individual. Customer registration and event scheduling information is also stored in the database 
to automate and control event operations using the Rob-Cop (Expert System), and to administrate 
the transaction / business relationship. 

25 II) Applications 

A number of exemplary applications of the systems and methods of the present invention 
are provided below, 

30 A. Foreign Language Motion Pictures 

When a major motion picture is in an Enghsh-speaking country and to be released in a 
non-English-speaking coimtry, the English dialogue has to be replaced with the language of the 

30 



country that the film will be screened in. It is to the film company's advantage, to release the 
"Foreign Version" as soon as possible after the release of the film in "Domestic Version." 
Foreign versions generally cannot be released at the same time as the domestic version because 
the film director typically continues to edit the film right up to the last day before the sound track 

5 is sent to the laboratory for processing. In today's motion picture business, the movie is 

completed about ten days before the release date. Once the film is completed, a new sound track 
is made that does not have any dialogue in it (ie,, it is a version with only music and effects). 
This copy, known as an "M & E," is sent to every foreign territory. It is played for a translator 
who writes a script for the finished film. New dialogue is recorded in the foreign language to best 

10 match the script and the lip movement of the original actors on the screen. The new dialogue is 
then mixed into the M & E and a new sound track is created. Foreign prints are made and the 
film is released to theatres. To help speed up the process, any reels of the film that the director 
says will not be re-edited are sent to the foreign territories along with a temporary mix of the 
sound before the picture is finished. However, the director will usually re-edit the reels that were 

15 previously designated as complete. Some of the new dialogue recordings will not be used and 
some will have to be re-done when the film is finished. This process adds delays. The sound has 
to be reedited and re-mixed in the foreign language to make up for the changes. In the current 
system, every change has to be shipped overseas, go through customs and be delivered to the 
sound studio. This can take up to a week for every change, 

20 Using the systems and methods of the present invention, time and cost is significantiy 

reduced. The systems of the present invention allow multimedia information to be transferred 
over the Internet. For example, using the systems of the present invention, text translations are 
readily made and synchronized to the video and "M&E" audio. This is important because the 
length of the film cannot vary fi-om the original by more than + or - l/48th of a second and the 

25 sound and picture cannot vary more than + or - l/48th of a second from each other. The systems 
of the present invention allow deHvery of a script with every sound change and allow a 
synchronized product to be available in less than a day. Moreover, a text file of all dialogue can 
be to be provided, as required by the industry. 

Thus, the systems and methods of the present invention provide a comprehensive Internet 

30 based solution that enables overseas territories to efficiently and timely re-dub motion pictures in 
domestic languages. Throughout the iterations of a motion pictures development, the audio, 
video, and corresponding text are distributed overseas online, eliminating logistical bottienecks 
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associated with sending physical reels and the time associated with waiting for transcriptions. 
The product can be delivered promptly in under a day and in multiple languages. 

A similar process can be applied to provide translated text (e.g., subtitles) for television 
programming or any other multimedia presentation where it may be desireable to have language 

5 translations applied (e.g., video presentations on airlines). One embodiments for video translation 
and sub-titling is shown in Figure 8. In this figure, an original video with audio in a first language 
(e.g., Enghsh) is processed into encoded audio and video (e.g., in .WMA and.WMV file formats). 
In some embodiments, encoded audio and low quality encoded video are sent (e.g., via Web FTP) 
to a conference bridge of the present invention, where audio is converted to text by a speech-to- 

10 text converter and translated by a language translator using methods described above. The 

translated text (e.g., in the form of a translated script) is then sent to a foreign territory where the 
translated information is used to re-dub the video with foreign language voice over. Text 
information (in one or more different languages) may also be sent to a video studio to prepare sub- 
titles in any desired language (e.g., as a final product or for preparing an intermediate video to be 

15 sent to the foreign territory to prepare a re-dubbed video). The physical location of any of the 

systems does not matter, as information can be sent fi:om one component of the system to another 
over communication networks. 

B. Transcripts of News Events and Business and Legal Proceedings 

20 Many newsworthy events (e.g., political speeches, etc.), business proceedings (e.g., board 

meetings), and legal proceedings (e.g., trials, depositions, etc.) benefit fi"om or require the 
generation of text transcripts (and optional translations) of spoken language. The systems and 
methods of the present invention provide means to generate real-time (or subsequent) text 
transcripts of these events. The text transcripts can be provided so as to allow fiiU manipulation of 

25 the text (e.g., searching, copying, printing, etc.). For example, news media personnel can receive 
real-time (or subsequent) transcripts of newsworthy speeches, allowing them to select desired 
portions for use in generating their news reports. A major advantage of using the systems and 
methods of the present invention is that the user of the text information need not be present at the 
location where the event is occurring. Virtual business meetings and legal proceedings are 

30 possible, where each of the participants receives a real-time (or subsequent) copy of the text of the 
proceeding, as it occurs. Non-live event transcripts/translations are created after the audio firom a 
prior live event has been recorded for subsequent playback for transcription and translation by 
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captionist/transcriptionist. One embodiment of such an application is illustrated in Figure 7. A 
potential corporate customer registers (and is approved) on a web site and pre-buys a block of 
minutes (or hours) of transcription (and optionally translation) services. During a corporate 
meeting {e.g., Board Meeting), the meeting chairperson (e,g„ on a quahty speakerphone) calls into 

5 the systems of the present invention and enters their service access code for the transcription / 
translation services pre-purchased. The meeting participants conduct a normal meeting, speaking 
their name prior to participation. At the end of the meeting, the chairperson simply hangs-up the 
phone. Within a required duration (predetermined as a service option), the transcripts (in selected 
languages) are e-mailed or otherwise delivered to the designated address (or made available on a 

10 secured web sight). The customer's account is decremented, and they are notified when service 
time reaches a pre-determined balance. This service would also make the recorded audio available 
in the original (and optionally translated) languages. 

The systems and methods of the present invention may also be integrated with presentation 
software (e.g., MICROSOFT POWERPOINT) to faciUtate information exchange during 

15 presentations or demonstrations. For example, live or prerecorded POWERPOINT presentations 
are integrated with the streaming text and/or multimedia systems of the present invention to allow 
added information content to the sUdes presented in the POWERPOINT presentation. In some 
embodiments, viewers (e.g., participants at a business conference) can access the POWERPOINT 
presentation over the web and view the images (moving back and forth as desired) as they desire. 

20 

C. Internet Broadcasting 

The Intemet has become a primary source of information for many people and provides a 
means for providing up-to-date information globally. Unlike radio, television, and satellite 
transmissions, the Intemet is not limited to a finite number of "channels." Thus, a user can obtain 

25 news and information firom obscure sources and locations that would not otherwise be available. 
The systems and methods of the present invention allow efficient and flexible broadcasting of 
information over the Intemet — ^particularly for live events and for diverse groups of users who may 
have hmited access to audio and video monitoring devices and who may speak a wide range of 
languages. With the systems of the present invention, real-time streaming text, as well as audio 

30 and video is provided to users. The text and audio are selected to match the language of the user. 

A complete transcript is made available online upon the close of the event with view/print 
function, highest quality, automated translations into a dozen foreign languages, cut and paste 
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capabilities, and key word search function with a complete transcript time stamping function for 
exact synchronization between text and audio. 

D. Interactive Events 

5 The systems and methods of the present invention provide for interactive events involving 

viewers located in different areas. These interactive events include talk-show formats, debates, 
meetings, and distance learning events. In some embodiments, interactive events are conducted 
over the Internet An example of a talk-show format is provided in Figure 6. An event moderator 
can control the system through a web-based interface so that participants need not be burdened 

10 with equipment shipping, training, and maintenance. Participants can be anywhere in the world 
allowing for virtual web debates, distance instruction and education in which interaction is critical 
to the learning process, and intra-organizational communication within large organizations with 
multiple offices in various foreign countries. Any event that can benefit from question and answer 
interactivity with an offsite audience finds use with the systems and methods of the present 

15 invention. Participant questions can be directed over the telephone or typed as in a chat format and 
can be viewed by all other participants in real time and/or after the fact. The systems and methods 
of the present invention provide dramatic flexibility for involving participants who speak different 
languages. The systems and methods of the present invention translate all viewer comments and 
questions from their selected language to that of the screener (or moderator) to facilitate screening 

20 and prioritizing. All comments and questions entered (and approved by the screener) in various 
languages by all viewers are translated to the selected language of each viewer. This approach 
insures that all viewers gain the greatest benefit from an event, by interacting in their selected 
language for: streaming transcript, accumulative complete transcripts, audio dialogue, and 
comments / questions entered and received. In the embodiment shown in Figure 6, the web 

25 presenter accesses a database of the present invention to register and schedule the event. The 
database can also be used to store an image file of the presenter, presentation files (e.g., 
POWERPOINT presentation files), and a roster of information pertaining to invited participants. 
The information in the database may be updated during the presentation. For example, questions 
from viewer participants and responses may be stored on the database to allow them to be viewed 

30 at the request of any of the participants. Questions from viewer participants may be received 
aurally using voice-over IP technology. These questions are directed to the conference bridge, 
with the audio being converted to text by a speech-to-text converter and the text information 
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and/or corresponding audio information being routed to a processor for encoding as text and/or 
multimedia information streams, as well as storage in the database. At the request of any 
participant, the questions may be viewed as text and/or audio in any desired language. 

5 E. Text Transcriptions for the Hearing and Vision Impaired 

Hearing impaired individuals currently have access to closed captioning systems for use in 
conjunction with a limited number of movie and televised events. The systems and methods of the 
present invention provide superior resources for hearing impaired individuals, providing complete, 
cumulative text representations of audio events and allowing fully functional text for Internet 

10 multimedia events. With closed captioning technologies, words appear briefly on a viewer's 

screen, and are then gone. The systems and methods of the present invention allow aggregation of 
words into a complete document that can be made available in its entirety, in any desired language, 
during an event and/or at the end of events. The systems and methods of the present invention 
provide hearing impaired individuals access to Internet broadcasting events including, but not 

15 limited to, financial information, live news coverage, and educational content. At present, the 

hearing impaired community is being left out of the Internet broadcasting movement. The systems 
and methods of the present invention fill this gap, allowing hearing impaired, as well as vision 
impaired users, to automatically select the desired formatting (font size, style, color, text language) 
for their needs. 

20 

All publications and patents mentioned in the above specification are herein incorporated 
by reference. Various modifications and variations of the described methods and systems of the 
invention will be apparent to those skilled in the art without departing fi-om the scope and spirit of 
the invention. Although the invention has been described in connection with specific preferred 
25 embodiments, it should be understood that the invention as claimed should not be unduly limited 
to such specific embodiments. Indeed, various modifications of the described modes for carrying 
out the invention that are obvious to those skilled in the relevant fields are intended to be within 
the scope of the following claims. 
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